This exercise involves the UN data set from ALR. Download the alr4 library and load the data to answer the following questions adding your code in the code chunks. Please add appropriate code to the chuncks to suppress messages and warnings as needed once you are sure the code is working properly and remove instructions if no longer needed. Please switch the output to pdf for your final version to upload to Sakai.
##
## Attaching package: 'alr3'
## The following object is masked from 'package:MASS':
##
## forbes
summary(UN3)
## ModernC Change PPgdp Frate
## Min. : 1.00 Min. :-1.100 Min. : 90 Min. : 2.00
## 1st Qu.:19.00 1st Qu.: 0.580 1st Qu.: 479 1st Qu.:39.50
## Median :40.50 Median : 1.400 Median : 2046 Median :49.00
## Mean :38.72 Mean : 1.418 Mean : 6527 Mean :48.31
## 3rd Qu.:55.00 3rd Qu.: 2.270 3rd Qu.: 8461 3rd Qu.:58.00
## Max. :83.00 Max. : 4.170 Max. :44579 Max. :91.00
## NA's :58 NA's :1 NA's :9 NA's :43
## Pop Fertility Purban
## Min. : 2.3 Min. :1.000 Min. : 6.00
## 1st Qu.: 767.2 1st Qu.:1.897 1st Qu.: 36.25
## Median : 5469.5 Median :2.700 Median : 57.00
## Mean : 30281.9 Mean :3.214 Mean : 56.20
## 3rd Qu.: 18913.5 3rd Qu.:4.395 3rd Qu.: 75.00
## Max. :1304196.0 Max. :8.000 Max. :100.00
## NA's :2 NA's :10
for (i in 1:length(UN3))
{
print(c(colnames(UN3[i]), is.numeric(UN3[,i])))
}
## [1] "ModernC" "TRUE"
## [1] "Change" "TRUE"
## [1] "PPgdp" "TRUE"
## [1] "Frate" "TRUE"
## [1] "Pop" "TRUE"
## [1] "Fertility" "TRUE"
## [1] "Purban" "TRUE"
Answer: there are 6 variables that have missing data.
The variables with “TRUE” are quantitative – i.e., all vriables are quantitative
mean_stan<-matrix(data=NA, nrow = 6, ncol = 3)
m<-1
for (i in 1:(length(UN3)-1)){
mean_stan[m,]<-c(colnames(UN3[i]),mean(na.omit(UN3[,i])),sd(na.omit(UN3[,i])))
m<-m+1
}
stats.data <- data.frame(mean_stan)
colnames(stats.data)<-c("name","mean","sd")
knitr::kable(stats.data)
| name | mean | sd |
|---|---|---|
| ModernC | 38.7171052631579 | 22.6366103759673 |
| Change | 1.41837320574163 | 1.13313267030361 |
| PPgdp | 6527.38805970149 | 9325.18855244529 |
| Frate | 48.3053892215569 | 16.5324480416909 |
| Pop | 30281.8714278846 | 120676.694478229 |
| Fertility | 3.214 | 1.70691793716661 |
ModernC from the other variables. Are there potential outliers, nonlinear relationships or transformations that appear to be needed?gg<-ggpairs(na.omit(UN3),columns=c(2:7,1))
gg
Answers: it seems that fertility, purban, ppdgp, and change are useful in predicting modernC. (corr coeff > 0.5) ## Model Fitting
lm() function to perform a multiple linear regression with ModernC as the response and all other variables as the predictors, using the formula ModernC ~ ., where the . includes all remaining variables in the dataframe. Create diagnostic residual plot from the linear model object and comment on results regarding assumptions.g<-lm(ModernC~., data=UN3)
plot(g)
Answer: normality assumption is violated: Q-Q plot, not a straight 45-degree line: a lighter tail.
others look good, except for figure 3: some non-random stuff in it
boxTidwell from library car or graphical methods find appropriate transformations of the predictor variables to be used as predictors in the linear model. If any predictors are negative, you may need to transform so that they are non-negative. Describe your method and the resulting transformations.UN3$Change_pos<-UN3$Change+5
UN3_NA<-na.omit(UN3)
m<-cbind(UN3_NA$ModernC,UN3_NA$Change_pos,UN3_NA$PPgdp,UN3_NA$Frate,UN3_NA$Pop,UN3_NA$Fertility,UN3_NA$Purban)
k<-powerTransform(m)
k
## Estimated transformation parameters
## Y1 Y2 Y3 Y4 Y5 Y6
## 0.8689531 0.9553174 -0.1565315 1.0897567 0.0628630 0.2027349
## Y7
## 0.9285349
UN3_NAA<-UN3_NA
UN3_NAA$Change_pos<-NULL
k_bcn<-powerTransform(as.matrix(UN3_NAA[,-1])~.,family="bcnPower",data=UN3_NAA)
## Warning in sqrt(diag(res$invHess[1:d, 1:d, drop = FALSE])): NaNs produced
## Warning in sqrt(diag(res$invHess[(d + 1):(2 * d), (d + 1):(2 * d), drop =
## FALSE])): NaNs produced
k_bcn
## Estimated transformation power, lambda
## [1] 0.2951946 0.9999996 0.9999975 0.3251072 0.9999648 0.9999841
## Estimated transformation location, gamma
## [1] 4.658143e+00 1.212076e+00 2.844007e-02 1.304196e+06 2.294808e-02
## [6] 1.428396e-01
Here we can see that Fertility, PPgdp, and Pop have values near 0. We then look at their termplots to see if they really need transformations.
#z<-lm(ModernC~Change_pos+log(PPgdp)+Frate+log(Pop)+log(Fertility)+Purban,data=UN3_NA)
termplot(lm(ModernC ~.-Change,data=UN3_NA),terms="PPgdp",partial.resid = T, se=T, rug=T,smooth = panel.smooth)
termplot(lm(ModernC ~.-Change-PPgdp+log(PPgdp),data=UN3_NA),terms="log(PPgdp)",partial.resid = T, se=T, rug=T,smooth = panel.smooth)
termplot(lm(ModernC ~.-Change,data=UN3_NA),terms="Pop",partial.resid = T, se=T, rug=T,smooth = panel.smooth)
termplot(lm(ModernC ~.-Change-Pop+log(Pop),data=UN3_NA),terms="log(Pop)",partial.resid = T, se=T, rug=T,smooth = panel.smooth)
termplot(lm(ModernC ~.-Change,data=UN3_NA),terms="Fertility",partial.resid = T, se=T, rug=T,smooth = panel.smooth)
termplot(lm(ModernC ~.-Change-Fertility+log(Fertility),data=UN3_NA),terms="log(Fertility)",partial.resid = T, se=T, rug=T,smooth = panel.smooth)
Comparing each pair of termplots, we can see that: 1. it’s hard to tell whether log(PPgdp) is better than PPgdp.
2. it seems that log(Pop) is even worse than Pop.
3. it seems that not transforming Fertility to log(Fertility) is better.
We then look at addv plots to determine whether tranformations should take place.
mod1 = lm(ModernC ~ .-Change, data=UN3_NA)
avPlots(mod1)
mod2 = lm(ModernC ~ .-Change-Pop+log(Pop), data=UN3_NA)
avPlots(mod2)
mod3 = lm(ModernC ~ .-Change-Pop-PPgdp+log(PPgdp)+log(Pop), data=UN3_NA)
avPlots(mod3)
mod4=lm(ModernC ~ .-Change-Pop-PPgdp-Fertility+log(PPgdp)+log(Pop)+log(Fertility), data=UN3_NA)
avPlots(mod4)
By looking and comparing these addv plots, we can see that transferring PPgdp and Pop is necessary, while there is no need to transfer Fertility.
6. Given the selected transformations of the predictors, select a transformation of the response and justify.
y_trans<-lm(ModernC~Frate+Fertility+Purban+Change+log(PPgdp)+log(Pop), data=UN3_NA)
boxcox(y_trans)
Answer: the graph indicates that lambda is around 1, so we don’t need to transform the response.
plot(y_trans)
From the residual plots, we can see that the regression model of transformed variables looks better than the original one. The normal Q-Q plot still has a lighter tail.
test<-matrix(data=NA, nrow = 6, ncol = 3)
cc<-summary(y_trans)
for (i in 2:length(coefficients(y_trans)))
{
test[i-1,]<-c(rownames(cc$coefficients)[i],confint(y_trans, rownames(cc$coefficients)[i], level=0.95))
}
ci_data<-data.frame(test)
colnames(ci_data)<-c("Var name","2.5%","97.5%")
knitr::kable(ci_data)
| Var name | 2.5% | 97.5% |
|---|---|---|
| Frate | 0.0366942925093925 | 0.342092853921895 |
| Fertility | -13.1723343098266 | -6.17954853020956 |
| Purban | -0.264039098032114 | 0.122503127036324 |
| Change | 0.879749609153315 | 9.10616508489123 |
| log(PPgdp) | 2.72490389578725 | 8.28965295253567 |
| log(Pop) | 0.226969886316818 | 2.71717883733202 |
avPlots(y_trans,id.n=1)
termplot(lm(ModernC ~.-Change_pos-Pop-PPgdp+log(Pop)+log(PPgdp),data=UN3_NA),terms="log(Pop)",partial.resid = T, se=T, rug=T,smooth = panel.smooth)
termplot(lm(ModernC ~.-Change_pos-Pop-PPgdp+log(Pop)+log(PPgdp),data=UN3_NA),terms="Fertility",partial.resid = T, se=T, rug=T,smooth = panel.smooth)
termplot(lm(ModernC ~.-Change_pos-Pop-PPgdp+log(Pop)+log(PPgdp),data=UN3_NA),terms="Frate",partial.resid = T, se=T, rug=T,smooth = panel.smooth)
termplot(lm(ModernC ~.-Change_pos-Pop-PPgdp+log(Pop)+log(PPgdp),data=UN3_NA),terms="Change",partial.resid = T, se=T, rug=T,smooth = panel.smooth)
termplot(lm(ModernC ~.-Change_pos-Pop-PPgdp+log(Pop)+log(PPgdp),data=UN3_NA),terms="log(PPgdp)",partial.resid = T, se=T, rug=T,smooth = panel.smooth)
termplot(lm(ModernC ~.-Change_pos-Pop-PPgdp+log(Pop)+log(PPgdp),data=UN3_NA),terms="Purban",partial.resid = T, se=T, rug=T,smooth = panel.smooth)
Answer:There are 2 localities that influence Pop: China and India. 9. Are there any outliers in the data? Explain. If so refit the model after removing any outliers.
Answer: I don’t think so. According to the 4th residual plots (the one with cook’s distance), we can see that there does not exist points with high cook’s distance that will influence the result.
\[
( X^T_{(i)}X_{(i)})^{-1} = (X^TX)^{-1} + \frac{(X^TX)^{-1}x_ix_i^T (X^TX)^{-1}}{1 - h_{ii}}
\]where \(h_{ii}\) is the \(i\)th diagonal element of \(H = X(X^TX)^{-1}X^T\).
Start with the equation that we want to show, (1): \[( X^T_{(i)}X_{(i)})^{-1} = (X^TX)^{-1} + \frac{(X^TX)^{-1}x_ix_i^T (X^TX)^{-1}}{1 - h_{ii}}\] Multiply \[(X^TX)(1-h_{ii})\] to each side of (1), WTS: \[(X^TX)( X^T_{(i)}X_{(i)})^{-1}(1-h_{ii}) = (X^TX)(X^TX)^{-1}(1-h_{ii}) + (X^TX)(X^TX)^{-1}x_ix_i^T(X^TX)^{-1}\]
\[(X^T_{(i)}X_{(i)}+x_i x_i^T)( X^T_{(i)}X_{(i)})^{-1}(1-h_{ii})=I(1-h_{ii})+x_ix_i^T(X^TX)^{-1}\]
\[I(1-h_{ii})+x_i x_i^T( X^T_{(i)}X_{(i)})^{-1}(1-h_{ii})=I(1-h_{ii})+x_ix_i^T(X^TX)^{-1}\] Multiply\[X^T_{(i)}X_{(i)}\] to each side again, then the equation becomes: \[x_i x_i^T(1-h_{ii})=x_ix_i^T(X^TX)^{-1}(X^TX-x_ix_i^T)\]
\[x_ix_i^T(1-h_{ii})=x_ix_i^T(I-(X^TX)^{-1}x_ix_i^T)\] \[x_ix_i^Th_{ii}=x_ix_i^T(X^TX)^{-1}x_ix_i^T\] Notice that \[h_{ii}=x_i(X^TX)^{-1}x_i)\], and it is a scalar So the equation we want to show turns out to be \[x_ix_i^Th_{ii}=x_ih_{ii}x_i^T\] Which is obvious to be true.
Therefore, starting with this equation and going back, we can prove \[( X^T_{(i)}X_{(i)})^{-1} = (X^TX)^{-1} + \frac{(X^TX)^{-1}x_ix_i^T (X^TX)^{-1}}{1 - h_{ii}}\]
\[\hat{\beta}_{(i)} = \hat{\beta} - \frac{(X^TX)^{-1}x_i e_i}{1 - h_{ii}}\] where \(\hat{\beta}_{(i)} = ( X^T_{(i)}X_{(i)})^{-1} X_{(i)}^T Y_{(i)}\) and \(e_i = y_i - x_i^T\hat{\beta}\). Hint write \(X_{(i)}^T Y_{(i)} = X^TY - x_{i}y_{i}\).
\[\hat{\beta}_{(i)} = ( X^T_{(i)}X_{(i)})^{-1} X_{(i)}^T Y_{(i)}\] \[=( X^T_{(i)}X_{(i)})^{-1} = [(X^TX)^{-1} + \frac{(X^TX)^{-1}x_ix_i^T (X^TX)^{-1}}{1 - h_{ii}}][X^TY-x_iy_i]\] \[=(X^TX)^{-1}X^TY+(X^TX)^{-1}[\frac{x_ix_i^T (X^TX)^{-1}X^TY-x_iy_i(1-h_{ii})-x_ix_i^T (X^TX)^{-1}x_iy_i}{1-h_{ii}}]\]
\[=\hat{\beta}+\frac{(X^TX)^{-1}}{1-h_{ii}}[x_ix_i^T (X^TX)^{-1}X^TX\hat{\beta}-x_iy_i+x_iy_ih_{ii}-x_ih_{ii}y_i]\] \[=\hat{\beta}+\frac{(X^TX)^{-1}}{1-h_{ii}}[x_ix_i^T\hat{\beta}-x_iy_i]\] \[=\hat{\beta}-\frac{(X^TX)^{-1}x_ie_i}{1-h_{ii}}\]